This IPython notebook outlines some experiments with the static datasets on the Northamptonshire County Council website. It also steps you through using the DataFetcher
helper class, written in Python, which provides an API to:
First we need to import the third-party packages we will need. These are:
In [30]:
import os
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import re
Then, we import the DataFetcher
helper class from the analysis.datafetch.northants
module.
In [31]:
from analysis.datafetch.northants import DataFetcher
The DataFetcher class provides methods to access the datasets provided by the Northamptonshire County Council. The data seems to be organized in the following hierarchy:
Themes : These are broad areas of interest under which the various datasets have been organized. Some prominent examples:
Each theme has a unique id which is used in organizing the namespace for the website. The Datafetcher uses this to navigate around the site's various parts.
Sub-themes : These are further refinements of the individual themes -- for instance, Age & Gender datasets within the Population & Census theme.
Dataviews : Each data collection is called a DataView -- all files with the same dataview id have the same layout (ie columns), though they may be collected over different regions.
geo ids & names : As mentioned before a DataView could have multiple datasets collected over different target areas, for instance, population data collected at the county, district or country level etc.
First, we need to instantiate a DataFetcher object as follows:
In [39]:
fd = DataFetcher()
Then this newly created DataFetcher
object needs to be configured before use -- either by loading a previously saved configuration, or by querying the website. The relevant methods for doing this are:
sync_metadata()
: This method fetches the metadata from the county website. The metadata is needed to navigate throught the dataset hierarchy and to organize the data when it is downloaded.save_metadata(file_name)
: Saves the metadata in the given file in JSON format.load_metadata(file_name)
: This method loads the JSON-encoded metadata that we may have saved previously with save_metadata()
In [40]:
if os.path.exists('metadata.json'):
fd.load_metadata('metadata.json')
else:
fd.sync_metadata()
fd.save_metadata('metadata.json')
Before we can analyse the datsets, we need to know what the datasets contain and how they are organized.The datasets are stored in the CSV text format, and we need to know the layouts of these CSV files. The DataFetcher
helper class provides methods to make this process a bit less painful :-) :
get_dataview_ids()
: Returns a list of dataview ids, which are unique integers that identify each dataset. These ids can be used as arguments to the fetch_dataview_csv()
method described below to fetch the CSV file/s for a dataview.get_dataview_ids_with_details()
: Returns a python dict
keyed on dataview ids, with the value being another dict
that describes the dataview. The attributes returned are:id
: The dataview's id title
: Tells us what the dataset is.theme
: what is the broad theme under which this dataset belongs.subtheme
: The sub-theme within the larger theme.description
: A more detailed description of the dataset's contents, collection methods etc.As the the example below shows the dict returned by get_dataview_ids_with_details
can be loaded directly into a pandas DataFrame.
In [33]:
dataviews = fd.get_dataview_ids_with_details()
dvdf = pd.DataFrame(dataviews).transpose().sort(['theme'])
dvdf
Out[33]:
Once the DataFetcher
object has been intiialized, and the dataset layout is known to us, we can locate and download the CSV files we need. The CSV's can then be directly loaded into pandas DataFrame
s for subsequent analysis.
The DataFetcher
class provides the following methods for downloading data from the server:
download_dataviews(dir_name, geo_name, verbose=False)
: Fetches al the dataviews for the geographical unit specified geo_name
(county, district etc.), and stores them under the dir_name
folder, organized by theme & subtheme. The verbose
flag, if turned on, causes the method to spew out some status info as it downloads the data.download_all_dataviews(dir_name, verbose=False)
: Fetches all the dataviews available on the county website and stores them under the dir_name
folder, organized by theme & subtheme.fetch_dataview_csv(dv_id, geo_name, verbose=False)
: Fetches the dataview specified by dv_id
for the specified geo_name
and returns a stream handle to the server response -- this can be used to load up a pandas DataFrame
, for instance, as shown in the example below:
In [34]:
fp1 = fd.fetch_dataview_csv(54, 'county')
fp2 = fd.fetch_dataview_csv(54, 'district')
fp3 = fd.fetch_dataview_csv(54, 'region')
infants_county = pd.read_csv(fp1)
infants_district = pd.read_csv(fp2)
infants_region = pd.read_csv(fp3)
infants_county
Out[34]:
We can now use pandas and matplotlib to manipulate these DataFrame
s and visualize the results:
In [35]:
def get_filtered(df, col_regex, region):
def get_filter_years(label):
m = col_regex.match(label)
if not m is None:
return int(m.group(1))
return None
columns = filter(lambda x : False if get_filter_years(x) is None else True, df.columns)
years = map(get_filter_years, columns)
filtered_df = df[df['Name'] == region][columns].transpose()
filtered_df.index = years
filtered_df.columns = [region]
return filtered_df
In [36]:
# Modified From http://matplotlib.org/examples/api/barchart_demo.html
import re
reg_births = re.compile(r'^Live\ Births\ \:\ Total\((\d+)\).*$')
reg_deaths = re.compile(r'^Infant\ mortality\ rate\((\d+)\).*$')
def make_bar_chart(df, col_regex, ax, width, title, ylabel):
north_df = get_filtered(df, col_regex, 'Northamptonshire')
north_bar = ax.bar(north_df.index, north_df['Northamptonshire'], width, color='r')
midlands_df = get_filtered(df, col_regex, 'East Midlands')
midlands_xrange = map(lambda x : x + width, north_df.index)
midlands_bar = ax.bar(midlands_xrange, midlands_df['East Midlands'], width, color='g')
england_df = get_filtered(df, col_regex, 'England')
england_xrange = map(lambda x : x + width, midlands_xrange)
england_bar = ax.bar(england_xrange, england_df['England'], width, color='b')
# add some text for labels, title and axes ticks
ax.set_ylabel(ylabel)
ax.set_title(title)
ax.set_xticks(map(lambda x : x + 1.5 * width, north_df.index))
ax.set_xticklabels( map( lambda x : str(x), north_df.index ) )
# Legends
ax.legend( (north_bar[0], midlands_bar[0], england_bar[0]), ('Northamptonshire', 'East Midlands', 'England') )
In [37]:
fig, ax = plt.subplots()
width = 0.5 # the width of the bars
fig.set_size_inches(16, 8)
make_bar_chart(df=infants_county,
col_regex=reg_births,
ax=ax,
width=width,
title='Live Births in Northants compared to East-Mid region & Eng',
ylabel='Live Births')
ax
Out[37]:
In [38]:
fig, ax = plt.subplots()
width=0.2
fig.set_size_inches(12, 8)
make_bar_chart(df=infants_county,
col_regex=reg_deaths,
ax=ax,
width=width,
title='Infant Mortality in Northants compared to East-Mid region & Eng',
ylabel='Infant Mortality (%)')
ax
Out[38]:
The demographics data provided by the Northamptonshire County provide a useful starting point for understanding the community in question. The way the data is organized on the website is of special interest to us, as a guide on how to organize the datasets we collect data from our investigations. This is important as the datasets we collect would originate from multiple datasources & social media platforms (facebook, twitter, linkedin etc.), and would need to be collated across different facets (eg professional info, activities, education etc.). So, some prior thought on organizing these datasets would benefit us in the long run.
The DataFetcher API provides a programmatic interface for accessing the datasets on the council website and downloading them to local storage. Going ahead, the DataFetcher (or something equivalent) could be integrated with social-media crawlers and database backend storage (e.g. a document store such as mongodb) to serve as a layer of storage abstraction for any analysis/ML tools we may need to build.
IPython notebooks can be very useful for quickly protyping any data-science ideas or tricks we may want to exchange, and also as a tool for literate programming. Although this notebook was authored in Python, the Jupyter/IPython project supports other data-centric languages such as R & Julia as well.
In [22]:
In [ ]: